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Abstract 

In this paper we present a fully trainable binarization solution for degraded document images. Unlike 
previous attempts that often used simple features with a series of pre- and post-processing, our solution 
encodes all heuristics about whether or not a pixel is foreground text into a high-dimensional feature 
vector and learns a more complicated decision function. In particular, we prepare features of three types: 

1) existing features for binarization such as intensity dl, contrast |2l, O, and Laplacian a, 0; 2) 
reformulated features from existing binarization decision functions such those in a and J?!; and 3) our 
newly developed features, namely the Logarithm Intensity Percentile (LIP) and the Relative Darkness 
Index (RDI). Our initial experimental results show that using only selected samples (about 1.5% of 
all available training data), we can achieve a binarization performance comparable to those fine-tuned 
(typically by hand), state-of-the-art methods. Additionally, the trained document binarization classifier 
shows good generalization capabilities on out-of-domain data. 

1. Introduction 

As one of the most fundamental preprocessing methods in various document analysis work ll8l, 0, 
m, CD, ca, d, 0 , ca, s, ca, document binarization aims to convert a color or grayscale 
document image into a monotonic image, where all text pixels of interest are marked in black with a 
white background. Mathematically, given a document image WxH, image 

binarization assigns each pixel Dij a binary class label Bij according to a decision function /binarize(*) 
in a meaningful way, namely 

j foreground class 1, if/bmarize(A,i) < 0 
— \ ( 1 ) 
I background class 0, else 
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A successful document binarization process discards irrelevant and noisy information while preserving 
meaningful information in the binary image B = This process reduces the space to represent a 

document image, and largely simplifies the complexity of advanced document analysis tasks Hj. 

Although human do not often face many difficulties in identifying texts even on some low-quality 
document images, the document image binarization problem is indeed subjective and ill-posed Hi, and 
it involves many different challenges and combinations of challenges. For example, several of the well- 
known ones are 1) how to handle document degradations like ink blob, fade text etc.; 2) how to deal 
with uneven lighting; and 3) how to differentiate bleed-through text from normal text. In such difficult 
scenarios, human actually uses high-level knowledge that might not be easily captured by low-level 
features-such as a script character set and background texture analysis-to help decide which pixel is 
foreground text. 

Classic solutions more or less seek heuristic thresholds in simple feature spaces. This can be further 
grouped into the so-called global thresholding and local thresholding methods fT4l according to whether 
this threshold is location independent or not. For example, Otsu’s method Jll binarizes a pixel Di^j by 
comparing its pixel intensity lij to an optimal global threshold Gth derived from intensity histogram |[T1 
as shown in Q 

fotsuih j) ~ ^i,j ~ G^th (2) 

In contrast Niblack’s method |[6| uses the decision function Q 

/Niblack(^5 j) = lij ~ h'ij ^Niblack^^j (3) 

where /Cnibiack is a parameter below 0, and pf - and denote the mean and standard deviation of 
pixel intensities within a region R of size wxh. Although heuristic solutions are very efficient-may only 
requiring a constant number of operations per pixel, and work fairly well on many well-conditioned 
document images, it is clear that simple features and decision functions are insufficient for handling 
difficult cases. 

To achieve robust document binarization, many efforts are being made in the areas of 1) image 
normalization/adaptation, 2) discriminative feature space, and 3) more complicated decision functions. For 
example, Lu et al. ca proposes a local thresholding approach that mainly relies on background estimation 
and stroke estimation. Su et al 0, m finds that Otsu’s thresholding helps attain more discriminative 
power in a local contrast feature space. Sauvola et al fTll . Q adds the parameter S^sauvoia fo allow a 
non-linear decision plane Q. 

fSa,uvo\a{h ~ l^ij ~f^Sauvo\ah^ij{ /s'sauvoia~3) (4) 
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Although many of these attempts work well when method assumptions are satisfied and method parameters 
are appropriate, adapting a heuristic binarization method to a new domain is often not easy. Indeed, 
Lazzara et al. lITSl show that the original Sauovla method might fail even for well-scanned document 
images because of text fonts of different sizes. 

Unsupervised learning recently dominates document binarization area. In |[T9l, a document image is 
first clustered into three classes, namely foreground, background and uncertain, and pixels in the uncertain 
class will be further classified into either the foreground or background class according to their distances 
from these two classes. In (il, 0, an image is first transformed into a Laplacian feature space, and a 
global energy function is constructed to ensure that resulting binary labels are optimal in the sense of 
a predefined Markov random field. In ll20ll . an unsupervised ensemble of expert frameworks is used to 
combine multiple binarization candidates. Although these methods do not require a training stage, some 
rely on theoretical models or heuristic rules whose assumptions may not be necessarily satisfied, some 
require expensive iterative tuning and optimizations, and thus no surprise to see they are not reliable for 
certain types of degradations ||2T]|. 

Although image binarization is clearly a classification problem, supervised learning-based binarization 
solutions are still rare in the community. In this letter we discuss our initial attempts to solve the the 
document image binarization problem using supervised learning. The remainder of our paper is organized 
as follows: Section II overviews our solution and discusses all used features. Section III provides 
implementation details related to training and testing. Section IV shows our experimental results, and 
Section V concludes this paper. 


II. Feature Engineering 

Our goal is to develop a generic solution without preset parameters and pre- or post-processing. 
Specifically, we are interested in learning a decision function /ours(-) that maps a nd feature vector 
Xij extracted around a pixel Dij to a binary space {0,1} in a meaningful way, i.e. 

BiJ = fonvsiXij) (5) 


Detailed feature engineering discussions are given below. 

1) Existing Features: Since a number of simple tasks can be accomplished just by applying Otsu’s 
method. We thus include a pixel intensity li^j and its deviation from the Otsu’s threshold as features 
below 


x^Localint. 






( 6 ) 
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x^UT 


Otsu diff. 


— lij — Gth 


(7) 


In addition, we also use local statistics of Eqs. ([^ and Q, but with respect to different scales, i.e., 

^2 V 2 

( 8 ) 


hJ 


j^Local avg. | R = E E 

P =-^/2 g ^- V2 
"/s V 2 




Localstd.jR_ R _ 


4' 


(9) 


P =-'^/2 < 3 ^- V2 

where we make the size w = h = ks of local window R be associated with scales k G [1,2,4,8], and 
estimate stroke width s using Su’s method O. Inspired by the success of the Su ||2l, O and Howe 
methods 01, O, we include their contrast and Laplacian features shown in ( [T0| ) and (H} . 


j^Su\R_ 


argmayi{Ii+pj+g} - argmin{Ij+pj+g} 

p,q^R p,q^R 

arg max{Ii+pj+q}+arg mm{Ii+pj+q}+esu 

p,q^R p.q^R 


( 10 ) 

( 11 ) 


2) Exponential Truncated Niblack Index: To include Niblack’s decision function in our considerations, 
we first rearrange terms in Q according to /nibiack= 0 , as shown below 

kmh\ack{i,j\R) = {h,j - 

and then compute a so-called Exponential Truncated Niblack Index (ETNI) feature as follows. 

,R 


( 12 ) 


X 


ETNI I it 


exp{A:Niblack(*,j|.R)}, iR,i < 




(13) 


1, otherwise 

Fig.[T] compares an image in the original form and its corresponding ETRI feature space. 
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Fig. 1: ETRI features for image DIBC02010_HW04. (a) Original image; (b) ETNI feature for R of size 64 x 64. 

3) Logistic Truncated Sauvola Index: Similarly, we rearrange terms in Sauvola’s decision function 
(0 according to /sauvoia = 0 for its key parameter /csauvoia as follows, 

lid/- 1 


^Sauvola('^5 J |7?) 


*S^Sauvola 


(14) 
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Since fcsauvoia (^5 j|^) could be (— 00 , oc), we normalize this index by using the logistic function shown 
in Eq. (HD), and call it the Logistic Truncated Sauvola Index (LTSI), 

^LTSI|i|_l 0 ) ifcri^j>‘S'sauvola 

^H-exp[-fcsauvola(i,j|^)})“\ Otherwise 

where the range of is [Ql], and the condition of-- < /Ssauvoia ensures the sign consistency of 

/csauvoia (^5 j|^)- LTSI thus reflects the Sauvola decision surface. A sample result of the LTSI feature is 


given in Fig. 
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Fig. 2: LTSI features for image DIBC02011_PR05. 
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Original image; (b) LTSI feature for R of size 8x8. 


4) Logarithm Intensity Percentile Features: Intuitively, the darkness of a pixel is related to whether it 
is a text pixel. Given a region S, the percentile of the pixel’s intensity can be computed as 


perc(i, j|S') 


E 

p,q&S 


1 [0,oo) iji,j Ii+p,j+q) 


ll^ll 


(16) 


where l[opo)(f) denotes the indicator function whose value is 1 when t G [Qoo) and 0 otherwise, and 
n il denotes the cardinality function. It is clear that this percentile is a type of rank feature, and thus 
is invariant to any monotonic transform on the original intensity space. To give a higher resolution for 
lower percentiles, we use the logarithm version of ( [T^ as shown in ( [T7| ), and call it Logarithm Intensity 
Percentile (LIP) feature. Here Thperc is a threshold ( =.01 in this paper). 


^LIP|S _ I 1 . 0 ,ifperc(i,j|S') < Thperc 

[ logThp,„(perc(i, j|S')), otherwise 

With regard to 5, we make parallelogram S cover multiple rows, columns, diagonals, and inverse 
diagonals. The number of rows, columns, diagonals and inverse diagonals in S is made to be k times the 
estimated stroke width s. Finally, we also compute the LIP features with respect to the entire image and 
the maximum percentile among all previously extracted LIP features. Fig. shows the original document 
with its corresponding features in the LIP spaces. As one can see, the LIP space indeed provides more 
discriminative powers. 
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(a) (b) (c) (d) (e) (f) 

Fig. 3: LIP features for image DIBCO2011_HWL (a) original image; (b) global LIP; (c)-(e) LIP along row, column, 
and diagonal; and (f) max LIP of all directions. 


5) Relative Darkness Index Features: Inspired by the great success of local ternary pattems(LTP) ll22l 
in face recognition, we borrow their essences here. LTP relies on the comparison of a center pixel’s 
intensity with each pixel in a set of neighbors {Ni^ • • • that are on a radius r circle, and the Ith 

code in a length-A; code string is defined as 


+ 1, if ^i+rij+ci ^ lij “b tol 

ltp(P^j,/) = < —1^ if — tol (18) 

O5 if ~ ^ij\ ^ fol 

where ri and q denote the relative coordinates of a neighbor Ni w.r.t. a center pixel, and tol is a preset 
tolerance. However, the number of possible LTP codes is often huge to effectively encode. Though one 
may reduce this number by considering all shift-equivalent codes as one, or separating a ternary code 
into two binary codes, we find that the simple frequency count of each code in a code string has already 
revealed many intrinsic properties of pixels, and we call them the Relative Darkness Index (RDI) features. 
Precisely, given the code C and neighbors on a radius r circle, the RDI feature can be defined as below 


^RDI|C,r _ 




= E 

1=1 


lo(ltp(P^j,/) - C) 
k 


(19) 


As one can see from Figj^c-e), most of the nearly homogeneous background parts are of high code 0 
indices; pixels close to strong edges are dominated by code+1 indices, and foreground text pixels have 
high response on code-1 indices. To further enhance RDTs discriminative power, we compute the ratios 
of one code to the sum of itself and another code as well (see FigQf-h)). 

6) Other Features: Besides of features discussed above, we extract features from the global image 
statistics, including the mean and standard deviation of the entire image intensities, the mean and standard 
deviation of the percentile image, the 32 bins of normalized histogram (sum to 1) for image intensities, 
and the 32 bins of a normalized logarithmed histogram. 
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Fig. 4: RDI features for image DIBC02013_PR05 (darker pixels indicate a value close to 0). (a) original color 
image; (b) original image; (c)-(e) RDI feature X^^^l^’^for C G {Q — 1^+1}, respectively; (f) ^ 

V&/J^RDI|C6{-1, + 1},8 ’ UV j5^RDI|Ce{-l,0},8 * 

III. Training and Testing Settings 

In experiments, we use the widely accepted Document Image Binarization Contest (DIBCO) from 
2009 to 2014 0, Hi, (H, CD, d, d as our training and testing data; it totals 76 images. We adopt 
the leave-one-out strategy where we first pick a DIBCO image set of a particular year as our testing set, 
and use the rest as our training set. 

1) Feature Summary: We summarize all used features with dimensions and corresponding normaliza¬ 
tion considerations in Table [I| Here, the stroke width s can be estimated via various methods; we use 
Su’s method O. ‘Scale’ indicates the side of local square region R. 


TABLE I: Used Features 


Type 

Scale 

Dimension 

Normalization 

Local int. 

N/a 

1 

divide by 255 

Otsu diff. 

N/a 

1 

divide by 255 

Local avg./std. 

Is,2s,4s,8s 

4/4 

divide by 255 

Su/Howe 

l,ls,2s,4s 

4/4 

MinMax 

ETRI/LTSI 

ls,2s,4s,8s 

4/4 

N/a 

LIP 

l,ls,2s,4s,8s 

1+4x4+! 

N/a 

RDI 

l,ls,2s,4s,8s 

5x6 

N/a 

Global int. avg./std. 

N/a 

1/1 

divide by 255 

Global perc. avg./std. 

N/a 

1/1 

N/a 

Global int./perc. loghist. 

N/a 

32/32 

N/a 

Total 


142 



2) Sampling Strategy: Selecting training samples is essential in task. First, one may not be handle 
a big training set of this task. These 76 images totally contain more than 80 million pixels. Assuming 
each feature is store in float32 format, we need 80x 142x4MB (::^256GB) memory for just training 
features, while this requirement clearly beyond the capacities of most computers nowadays. Second, one 
may notice the imbalanced training data. We know both the background nontext class and foreground 
text class in the binarization problem actually cover different subclasses Il23]| . while we also know 
nearly homogeneous background and foreground dominate our training data. 
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To solve both problems, we first artificially classify all pixels in an image into 16 subclasses, each is 
represented as a 4-bit string, where each bit indicates whether or not this pixel should be treated as a 
pixel in Otsu’s foreground, in Niblack’s foreground, within s pixels away from reference image edges, 
and in a reference annotated foreground. We draw the same number of random samples for each subclass. 
Fig. illustrates the samples we extracted that balanced both foreground and background subclasses. 



Fig. 5: Sampling strategy, (a) pixels with subclass labels for image DIBC02012-HW02 (each color denotes 
a subclass); (b) samples extracted from DIBC02012-HW02 image balanced subclasses (red/green dots indicate 
background/foreground.) 

3) Training and Testing Strategies: In all of the following experiments, we perform a two-pass training. 
We first extract 9,600 samples (subclass balanced) from each training image and train a simple classifier, 
say Gaussian Naive Bayes. We use this classifier to decode all training images, and extract additional 
9,600 erroneous samples (subclass balanced) from each image, and use all extracted samples to train an 
more complicated sklearn ExtraTrees classifier ||25l. Note in total we extract 19,200 samples per 
image, which only account for roughly about 1.5% of all samples. Classifier parameters are obtained 
from a 10-folded cross-validation using all samples. A final classifier is trained by using all extracted 
samples and validated parameters. Fig. plots the feature importance of each feature type in terms of the 
overall contribution and the averaged dimensional contribution with respect to each feature type. As one 
can see, RDI, Global int. hist, and LIP are the three most useful feature categories in terms of overall 
contributions; and Su, LTSI and RDI are the three best features in terms of dimensional contributions. 

In testing, we use the final classifier to predict the class label for all pixels in a testing image. Depending 
on the size of an image, the decoding time may vary between 5s to 30s. 

IV. Experimental Results 

1) Performance on DIBCO Datasets: Table II lists performance of our proposed supervised binarization 
solution over the DIBCO 2012 CIl, 2013 Ha, and 2014 Cl datasets using standard metrics Fl-score, 
peak signal-to-noise ratio (PSNR), and distance reciprocal distortion (DRD) ( metric definitions can be 
found in na, m, na, im, ca, m ). As we can see, our performance is comparable to the top five 
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■ LIP 

■ Local int. 

■ Otsu diff. 

■ RDI 

■ LTSI 

■ HOWE 

■ Global stat. 

■ Global int. hi 

■ Su 

■ ETNI 


1.30 


Overall Feature Importance 


Feature Importance per Dimension (%) 


Fig. 6: Feature importance. Left: overall importance of each feature type; and Right: dimensional feature importance 
for each feature type. 

methods. We also notice that our binarization classifier’s performance is very stable among all three 
datasets, especially since it always keeps a DRD score below 3 pixels. Sample decoding results are 
compared to the top two contest methods in Fig. |7] As one can see, our supervised solution successfully 
learnt knowledge to handle difficult cases: 1) faded text; and 2) text on a dirty background. 

TABLE II: Performance Evaluations On DIBCO Datasets 



Method 

Contest Rank 

Fl% 

PSNR 

DRD 


m 

1 

89.47 

21.80 

3.400 


Lelore et al.’s 1111 

2 

92.85 

20.57 

2.660 

o 






O 

0 

3 

91.54 

20.14 

3.048 

U 

PZ) 

HH 

Nina’s Hi] 

4 

90.38 

19.30 

3.348 

Q 

Yazid et al. ’s 1111 

5 

91.85 

19.65 

3.056 


Ours 


92.01 

19.92 

2.601 


Su et al.'s method 1121 

1 

92.12 

20.68 

3.100 


Cl 

2 

92.70 

21.29 

3.180 

O 






6 

ESI 

3 

91.81 

20.68 

4.020 

u 

ESI 

4 

91.69 

20.54 

3.590 

s 

ESI 

5 

90.92 

19.32 

3.910 


Ours 


91.40 

20.13 

2.637 


Mesquita et al.'s fT^ 

1 

96.88 

22.66 

0.902 


m 

2 

96.63 

22.40 

1.001 

o 






6 

Ell 

3 

93.35 

19.45 

2.194 

u 

Ziaratban et al ’s flsl 

4 

89.24 

18.94 

4.502 

s 

Mitianoudis et al.'s Qs] 

5 

89.77 

18.49 

4.502 


Ours 


92.69 

19.47 

2.571 


2) Learning Curve: As we mentioned previously, only about 1.5% of all available training samples are 
used in our experiments. We investigate the relationship between the amount of training samples and the 


binarization performance using the test set of DIBCO 2012 in Table III As in many pattern recognition 
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Fig. 7: Binarization results for image, (a) original images; (b) reference binarized images (highlighted red regions 
indicate disagreements); (c) results of contest rank 1; (d) results of contest rank 2; and (e) our results. 

problems, the improvement of binarization performance gets smaller as the number of samples increases. 


TABLE III: Performance v.s. Training Samples 


#Samples 

1,920 

5,760 

9,600 

13,440 

15,360 

17,280 

19,200 

Fl% 

91.47 

91.77 

91.90 

91.93 

91.95 

92.01 

92.01 

PSNR 

19.64 

19.81 

19.86 

19.86 

19.88 

19.93 

19.92 

DRD 

2.797 

2.689 

2.637 

2.634 

2.618 

2.599 

2.601 


3) Document Binarization in the Wild : Although images in DIBCO datasets have already covered a 
wide range of variations, there are clearly more variations and combinations of variations that are not 
included in DIBCO training data. We therefore test our learned classifier on out-of-domain document 
images, and we observe satisfactory results (see Fig. [^. 
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Fig. 8: Binarization results of out-of-domain data 

V. Conclusion 


In this paper we investigate the document binarization solution via supervised learning. Unlike previous 
efforts, this solution is parameter-free and fully trainable. Our experimental results showed that one can 
learn a reasonably well binarization decision function from a small set of carefully selected training 
data. Such a learned decision function not only works well for in-domain data, but can also apply to 
out-of-domain data. In future work, we will explore several interesting aspects such as discriminative 
features (e.g., image moments and connected component attributes) and classifier adaptation on the fly. 
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